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1.  INTRODUCTION 

Corpora  of  phrase-structure-annotated  text,  or  treebanks,  are  use¬ 
ful  for  supervised  training  of  statistical  models  for  natural  language 
processing,  as  well  as  for  corpus  linguistics.  Their  primary  draw¬ 
back,  however,  is  that  they  are  very  time-consuming  to  produce.  To 
alleviate  this  problem,  the  standard  approach  is  to  make  two  passes 
over  the  text:  first,  parse  the  text  automatically,  then  correct  the 
parser  output  by  hand. 

In  this  paper  we  explore  three  questions: 

•  How  much  does  an  automatic  first  pass  speed  up  annotation? 

•  Does  this  automatic  first  pass  affect  the  reliability  of  the  final 
product? 

•  What  kind  of  parser  is  best  suited  for  such  an  automatic  first 
pass? 

We  investigate  these  questions  by  an  experiment  to  augment  the 
Penn  Chinese  Treebank  [15]  using  a  statistical  parser  developed 
by  Chiang  [3]  for  English.  This  experiment  differs  from  previous 
efforts  in  two  ways:  first,  we  quantify  the  increase  in  annotation 
speed  provided  by  the  automatic  first  pass  (70-100%);  second,  we 
use  a  parser  developed  on  one  language  to  augment  a  corpus  in  an 
unrelated  language. 

2.  THE  PARSER 

The  parsing  model  described  by  Chiang  [3]  is  based  on  stochas¬ 
tic  TAG  [13,  14].  In  this  model  a  parse  tree  is  built  up  out  of  tree 
fragments  (called  elementary  trees),  each  of  which  contains  exactly 
one  lexical  item  (its  anchor). 

In  the  variant  of  TAG  used  here,  there  are  three  kinds  of  el¬ 
ementary  trees:  initial,  (predicative)  auxiliary,  and  modifier,  and 
three  corresponding  composition  operations:  substitution,  adjunc¬ 
tion,  and  sister-adjunction.  Figure  1  illustrates  all  three  of  these  op¬ 
erations.  The  first  two  come  from  standard  TAG  [8];  the  third  is 
borrowed  from  D-tree  grammar  [11]. 

In  a  stochastic  TAG  derivation,  each  elementary  tree  is  gener¬ 
ated  with  a  certain  probability  which  depends  on  the  elementary 
tree  itself  as  well  as  the  node  it  gets  attached  to.  Since  every  tree  is 


lexicalized,  each  of  these  probabilities  involves  a  bilexical  depen¬ 
dency,  as  in  many  recent  statistical  parsing  models  [9,  2,  4]. 

Since  the  number  of  parameters  of  a  stochastic  TAG  is  quite  high, 
we  do  two  things  to  make  parameter  estimation  easier.  First,  we 
generate  an  elementary  tree  in  two  steps:  the  unlexicalized  tree, 
then  a  lexical  anchor.  Second,  we  smooth  the  probability  estimates 
of  these  two  steps  by  backing  off  to  reduced  contexts. 

When  trained  on  about  80,000  words  of  the  Penn  Chinese  Tree- 
bank  and  tested  on  about  10,000  words  of  unseen  text,  this  model 
obtains  73.9%  labeled  precision  and  72.2%  labeled  recall  [1]. 


3.  METHODOLOGY 

For  the  present  experiment  the  parsing  model  was  trained  on 
the  entire  treebank  (99,720  words).  We  then  prepared  a  new  set 
of  20,202  segmented,  POS-tagged  words  of  Xinhua  newswire  text, 
which  was  blindly  divided  into  3  sets  of  equal  size  (±10  words). 

Each  set  was  then  annotated  in  two  or  three  passes,  as  summa¬ 
rized  hy  the  following  table: 


Set  Pass  1 
1  — 

2  parser 

3  revised  parser 


Pass  2 
Annotator  A 
Annotator  A 
Annotator  A 


Pass  3 

Annotators  A&B 
Annotators  A&B 
Annotators  A&B 


Here  “Annotators  A&B”  means  that  Annotator  B  checked  the 
work  of  Annotator  A,  then  for  each  point  of  disagreement,  both  an¬ 
notators  worked  together  to  arrive  at  a  consensus  structure.  “Parser” 
is  Chiang’s  parser,  adapted  to  parse  Chinese  text  as  described  by 
Bikel  and  Chiang  [1]. 

“Revised  parser”  is  the  same  parser  with  additional  modifications 
suggested  by  Annotator  A  after  correcting  Set  2.  These  revisions 
primarily  resulted  from  a  difference  between  the  artificial  evalua¬ 
tion  metric  used  by  Bikel  and  Chiang  [1]  and  this  real-world  task. 
The  metric  used  earlier,  following  common  practice,  did  not  take 
punctuation  or  empty  elements  into  account,  whereas  the  present 
task  ideally  requires  that  they  be  present  and  correctly  placed.  Thus 
following  changes  were  made: 

•  The  parser  was  originally  trained  on  data  with  the  punctua¬ 
tion  marks  moved,  and  did  not  bother  to  move  the  punctua¬ 
tion  marks  back.  For  Set  3  we  simply  removed  the  prepro¬ 
cessing  phase  which  moved  the  punctuation  marks. 

•  Similarly,  the  parser  was  trained  on  data  which  had  all  empty 
elements  removed.  In  this  case  we  simply  applied  a  rule- 
based  postprocessor  which  inserted  null  relative  pronouns. 

•  Finally,  the  parser  often  produced  an  NP  (or  VP)  which  dom¬ 
inated  only  a  single  NP  (respectively,  VP),  whereas  such  a 
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Figure  1:  Grammar  and  derivation  for  “John  should  leave  tomorrow.”  ai  and  02  are  initial  trees,  is  a  (predicative)  auxiliary  tree, 
7  is  a  modifier  tree. 


structure  is  not  specified  by  the  bracketing  guidelines.  There¬ 
fore  we  applied  another  rule-based  postprocessor  to  remove 
these  nodes.  (This  modification  would  have  helped  the  orig¬ 
inal  evaluation  as  well.) 

In  short,  none  of  the  modifications  required  major  changes  to  the 
parser,  but  they  did  improve  annotation  speed  significantly,  as  we 
will  see  below. 


4.  RESULTS 


The  annotation  times  and  rates  for  Pass  2 

are  as  follows: 

Set 

Pass  1 

Time  (Pass  2) 

Rate  (Pass  2) 

(hours:min) 

(words/hour) 

1 

— 

28:01 

240 

2 

parser 

16:21 

412 

3 

revised  parser 

14:06 

478 

The  rate  increase  for  Set  2  over  Set  1  was  about  70%;  for  Set  3  over 
Set  1,  about  double.  Thus  the  time  saved  by  the  use  of  an  automatic 
first  pass  is  substantial. 

Assessing  the  reliability  of  the  final  product  is  somewhat  trickier. 


Set 

Pass  1 

Accuracy  (Pass  1) 

Accuracy  (Pass : 

LP 

LR 

LP 

LR 

1 

— 

— 

— 

99.84 

99.76 

2 

parser 

76.73 

75.36 

99.76 

99.65 

3 

revised  parser 

82.87 

81.42 

99.81 

99.26 

where  LP  stands  for  labeled  precision  and  LR  stands  for  labeled 
recall.  The  third  column  reports  the  accuracy  of  Pass  1  (the  parser) 
using  the  results  of  Pass  2  (Annotator  A)  as  a  gold  standard.  The 
fourth  column  reports  the  accuracy  of  Pass  2  (Annotator  A)  using 
the  results  of  Pass  3  (Annotators  A&B)  as  a  gold  standard. 

We  note  several  points: 

•  There  is  no  indication  that  the  addition  of  an  automatic  first 
pass  affected  the  accuracy  of  Pass  2.  On  the  other  hand,  the 
near-perfect  reported  accuracy  of  Pass  2  suggests  that  in  fact 
each  pass  biased  subsequent  passes  substantially.  We  need 
a  more  objective  measure  of  reliability,  which  we  leave  for 
future  experiments. 


•  The  parser  revisions  significantly  improved  the  accuracy  of 
the  parser  with  respect  to  the  present  metric  (which  is  sensi¬ 
tive  to  punctuation  and  empty  elements).  On  Set  2  the  revised 
parser  obtained  78.98/77.39%  labeled  precision/recall,  an  er¬ 
ror  reduction  of  about  9%. 

•  Not  surprisingly,  errors  due  to  large-scale  structural  ambi¬ 
guities  were  the  most  time-consuming  to  correct  by  hand.  To 
take  an  extreme  example,  one  parse  produced  by  the  parser  is 
shown  in  Figure  2.  It  often  matches  the  correct  parse  (shown 
in  Figure  3)  at  the  lowest  levels  but  the  large-scale  errors  re¬ 
quire  the  annotator  to  make  many  corrections. 

5.  DISCUSSION 

In  summary,  although  Chiang’s  parser  was  not  specifically  de¬ 
signed  for  Chinese,  and  trained  on  a  moderate  amount  of  data  (less 
than  100,000  words),  the  parses  it  provided  were  reliable  enough 
that  the  annotation  rate  was  effectively  doubled. 

Now  we  turn  to  our  third  question:  what  kind  of  parser  is  most 
suitable  for  an  automatic  first  pass?  Marcus  et  al.  [10]  describe  the 
use  of  the  deterministic  parser  Fidditch  [6]  as  an  automatic  first 
pass  for  the  Penn  (English)  Treebank.  They  cite  two  features  of  this 
parser  as  strengths: 

1 .  It  only  produces  a  single  parse  per  sentence,  so  that  the  an¬ 
notator  does  not  have  to  search  through  many  parses. 

2.  It  produces  reliable  partial  parses,  and  leaves  uncertain  struc¬ 
tures  unspecified. 

The  Penn-Helsinki  Parsed  Corpus  of  Middle  English  was  con¬ 
structed  using  a  statistical  parser  developed  by  Collins  [4]  as  an 
automatic  first  pass.  This  parser,  as  well  as  Chiang’s,  retains  the 
first  advantage  but  not  the  second.  However,  we  suggest  two  ways 
a  statistical  parser  might  be  used  to  speed  annotation  further: 

First,  the  parser  can  be  made  more  useful  to  the  annotator.  A 
statistical  parser  typically  produces  a  single  parse,  but  can  also 
(with  little  additional  computation)  produce  multiple  parses.  Rat- 
naparkhi  [12]  has  found  that  choosing  (by  oracle)  the  best  parse  out 
of  the  20  highest-ranked  parses  boosts  labeled  recall  and  precision 
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Figure  2:  Parser  output.  Translation:  “These  businesses  also  transfer  and  spread  the  intellectual  property  rights  of  36,000  technolo- 

gles  to  other  businesses  and  organizations,  creating  an  income  of  4.43  billion  RMB.” 
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Figure  3:  Corrected  parse  for  sentence  of  Figure  2. 

from  about  87%  to  about  93%.  This  suggests  that  if  the  annotator 
had  access  to  several  of  the  highest-ranked  parses,  he  or  she  could 
save  time  hy  choosing  the  parse  with  the  best  gross  structure  and 
making  small-scale  corrections. 

Would  such  a  change  defeat  the  first  advantage  above  by  forcing 
the  annotator  to  search  through  multiple  parses?  No,  because  the 
parses  produced  by  a  statistical  parser  are  ranked.  The  additional 
lower-ranked  parses  can  only  be  of  benefit  to  the  annotator.  Indeed, 
because  the  chart  contains  information  about  the  certainty  of  each 
subparse,  a  statistical  parser  might  regain  the  second  advantage  as 
well,  provided  this  information  can  be  suitably  presented. 

Second,  the  annotator  can  be  made  more  useful  to  the  parser  by 
means  of  active  learning  or  sample  selection  [5,  7].  (We  are  as¬ 
suming  now  that  the  parser  and  annotator  will  take  turns  in  a  train- 
parse-correct  cycle,  as  opposed  to  a  simple  two-pass  scheme.)  The 
idea  behind  sample  selection  is  that  some  sentences  are  more  in¬ 
formative  for  training  a  statistical  model  than  others;  therefore,  if 
we  have  some  way  of  automatically  guessing  which  sentences  are 
more  informative,  these  sentences  are  the  ones  we  should  hand- 
correct  first.  Thus  the  parser’s  accuracy  will  increase  more  quickly, 
potentially  requiring  the  annotator  to  make  fewer  corrections  over¬ 
all. 
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